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Assuming Equal vs. Unequal Prior Probabilities of Group Membership in 
Discriminant Analysis: Effect on Predictive Accuracy 

ABSTRACT. Cross-validated classification accuracies were compared 
under assumptions of equal and varying degrees of unequal prior 
probabilities of group membership for 24 bootstrap and 43 simulated 
data sets. The data sets varied in sample size, number of predictors, 
relative group size, and degree of group separation. Total-group hit 
rates were used to compare the relative accuracies across six 
assumption.^ about prior probabilities. Contrary to expectations, use of 
population priors did not always yield the highest hit rate. When group 
sizes were similar, equal priors yielded greater classification accuracy 
than sample estimated priors. Results suggest that, when group sizes 
are similar, use of unequal priors may lead to a decrement in 
classification accuracy, even with knowledge of population priors. 

Theoretically, the assumption of unequal prior probabilities should lead to 
higher cross-validated classification accuracy as the difference between population 
group sizes increases. "Where we might tend to oversupply small groups and 
undersupply large ones by using resemblance as the sole basis for classification we 
introduce a corrective effect by taking prior probabilities of group membership into 
account" (Tatsuoka, 1988, p. 360). Consistent with this expectation, Rudolph and 
Karson (1988) found that estimated error rates using population priors were 
consistently lower than estimated error rates using equal priors. 

Although classification accuracy should increase with knowledge and use of 
population group sizes, these values are rarely known. Consequently, sample 
estimated values are generally used. However, the use of sample estimated values 
may be unwise. Huberty (1994, p. 65) argues that "priors should not correspond to 
the relative sample sizes unless ... a proportional sampling plan was utilized." Of 
course, proportional sampling presumes knowledge of population priors. Lindeman, 
Merenda, and Gold (1980, p. 211) point out that, "in most practical applications, the 
values of the prior probabilities are not known with sufficient accuracy to justify their 
use." Hence, these researchers have urged caution in using anything but equal prior 
probabilities of group membership for classification. 

The purpose of this study was to compare assumptions of equal versus varying 
degrees of unequal prior probabilities of group membership on cross-validated 
classification accuracy. The goal was to get some idea of the degree of difference in 
accuracy we might expect on application of these assumptions in practical 
classification problems. Implicit in this goal is the question of whether the increment 



that may be afforded by assuming unequal priors is worth the risk when population 
priors are unknown. 



2 



Method 

Cross-validated classification accuracies were compared under a variety of 
bootstrap and simulated data conditions (different sample sizes, predictor counts, 
relative group sizes, and prior probability assumptions) for the two-group 
classification problem. A total of 24 bootstrap and 48 simulated data sets were 
considered for each of six assumptions about prior probabilities of group membership: 

(1) sample n / sample N (Sample condition); 

(2) 1 / number of groups (Equal condition); 

(3) population n / population N (Pop-l-0 condition); 

(4) group size for smaller group is 15% larger (Pop-1-. 15 condition); 

(5) group size for smaller group is 30% larger (Pop-1-. 30 condition); and 

(6) group size for smaller group is 45% larger (Pop-l-.45 condition). 

The bootstrap data sets were obtained from 24 real data sets used in a prior 
classification methodology study (Morris & Huberty, 1987). No pathological 
distributional problems are known in any of the data sets; it is expected that they are 
much as one would find in typical classification studies. 

The 48 simulated populations were constructed according to multivariate 
normal models, with N ranging from 1270 to 20(X). The group means and covariance 
matrices needed for input to the population creation program were obtained from the 
24 real data sets mentioned in the previous paragraph. For 24 populations, group 
sizes were set to 10(X). The remaining 24 populations were identical to these, except 
that group sizes were proportional to the sample sizes found in the real data sets. 

The process for creating a population manifesting a specified covariance matrix 
is described in Morris (1975). The random normal deviates required by this method 
were created using the "Rectangle-Wedge-Tair method (Marsaglia, MacLaren, & 
Bray, 1964), with the required uniform random numbers generated by Park & 

Miller’s (1988) "minimal standard" algorithm. A FORTRAN computer program 
(modified for 64-bit word MS FORTRAN 5.0) provided by Dolker and Halperin was 
used for the variable generation. 

Classification rules for a randomly selected (with replacement) sample of the 
desired size were built with adjustments made for each of the six assumptions about 
prior probabilities. The adjusted classification rules were used to classify the entire 
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population according to Tatsuoka’s (1988) minimum chi square rule. This procedure 
was repeated 1000 times for the 24 bootstrap data sets and 250 times for each of the 
48 simulated populations, and the mean number of total-group correct classifications 
was used to compare the relative accuracies of the six assumptions. 

In order to be more confident in the results of this simulation, and in accord 
with Knuth’s (1969, p. 156) recommendation that "the most prudent policy for a 
person to follow is to run each Monte Carlo program at least twice using quite 
different sources of random numbers, before taking the answers of the program 
seriously," the entire simulation results were replicated. In the replication, Wichman 
and Hill’s (1982) uniform random number generator was used. This algorithm 
generates uniform random numbers by a triple modulo method. As described in 
Wilkinson (1987, p. 34 of the DATA module), "e<ich uniform is constructed from 
three multiplicative congruential generators with prime modulus," using 13579, 

12345, and 313 as initial seeds. While there were some small differences in the 
results of the replication, none were systematic, none were judged of importance, and 
the implications were the same. These replication results are available on request; the 
results presented in this paper are from the first random number generation method 
mentioned. 



Results 

For each of the data sets. Tables 1 , 2, and 3 give a short description, an index 
of group separation (2), the number of cases in each group (n, and 1 I 2 ), the number of 
predictor variables (p), and a comparison of the cross validated classification 
performance for each assumption about prior probabilities. Tables 2 and 3 also 
include an index of disproportionality (I), calculated as ( ( / n^_^^ ) * 100 ). 

The best performing assumptions are underlined. The difference in performance 
between underlined and nonunderlined assumptions was considered statistically and 
practically significant based on subjectively established criteria (a = .0(X)01 plus a 
mean difference in hit rates of .002, which represents 4 hits for data sets with 20(X) 
cases). The risk of a Type I error was actually much higher than .00001 due to the 
large number of significance tests conducted. Although statistical significance was 
considered less important than practical significance, an overall Hotelling ~P test, and 
then pairwise post hoc comparisons (multivariate analog of the Scheffd post hoc test; 
see Morrison, 1976, p. 147-148 for a description) were used to contrast the 
classification hit rates for the six assumptions. 




5 



4 



Results of Simulation for Data Sets with Equal Group Sizes (#1-24) 

The Equal and Pop+0 assumptions, which yield identical results with equal 
group sizes, were expectol to outperform the other four assumptions in all 24 data 
sets. As indicated in Table 1, the Equal and Pop+0 conditions were top contenders 
in all but one data set (#15), and yielded the highest (though not always significantly 
higher) hit rates in 18 data sets (#5 - 8, 10 - 14, 16 - 24). Thus, these assumptions 
were the best performers most of the time rather than all of the time, which was 
somewhat contrary to expectations. 



Insert Table 1 About Here 



The Sample assumption was expected to perform less well than the Equal and 
Pop+0 assumptions due to sampling error inherent in the random sampling process, 
but was still expected to outperform the three erroneous assumptions (Pop+.15, 
Pop+.30, Pop+.45). Results were consistent with this expectation. The Sample 
<issumption was a top contender in the same 23 data sets as the Equal and Pop+0 
assumptions. Nevertheless, compared to the Equal and Pop+0 assumptions, the 
Sample assumption yielded lower hit rates (though not significantly, based on a = 
.00001) in 21 of the data sets (4-24). 

The rank order of the erroneous assumptions was expected to be Pop +.15, 
Pop+.30, and Pop+.45 (i.e.,from least to most discrepant with actual group sizes). 
Results were consistent with this expectation. The Pop +.15 condition performed 
better than the other two erroneous assumptions and worse than the Equal, Pop+0, 
and Sample assumptions. The Pop+.15 assumption was a top contender in 12 of the 
24 data sets, and was the best performer in two data sets (#9, 15). The Pop+.30 
assumption significantly outperformed the Pop+.45 assumption in 19 of the 24 data 
sets (#6 - 24). 

Results of Simulation for Data Sets with Group Sizes Proportional to Real Data Set 
Sizes (#25-48) 

The Pop+0 assumption was expected to outperform the other five assumptions 
in all 24 data sets. As shown in Table 2, the Pop+0 condition was a top contender in 
all but four data sets (#39 - 42), and yielded the highest (though not always 
significantly higher) hit rates in 11 data sets (#30 - 34, 38, and 43 - 47). No other 
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assumption performed as well. Thus, although the Pop+0 was the best performer 
overall, the results were somewhat contrary to expectations because this assumption 
did not yield the highest hit rate with every data set. 



Insert Table 2 About Here 



The Sample assumption was expected to perform less well than the Pop 4-0 
assumption, again due to sampling error, but to outperform the four other 
assumptions. Results were consistent with this expectation for the Pop-l-0, Pop4-.15, 
Pop4-.30, and Pop4-.45 assumptions. Compared to the Pop-l-0 assumption, the 
Sample assumption yielded lower hit rates in 20 of the data sets (#27, 28, 30 - 47), 
though this difference was statistically significant only for data set #45 . The Sample 
assumption was a top contender in 18 of the 19 data sets for which the Pop-l-0 
assumption was also a top contender (#25 - 37, 39, 43, 44, and 46-48), and had the 
highest hit rates (though not significantly higher) in 2 data sets (#26 and 48). None 
of the erroneous assumptions matched this performance. Compared to the Pop 4-. 15 
condition, which was the best performing erroneous assumption, the Sample 
assumption yielded higher hit rates (though not always significantly higher) in 15 data 
sets (#26, 30 - 34, 36 - 38, 43 - 48). 

In the 20 data sets with unequal group sizes, the Equal assumption worked 
better than the Sample assumption only in data sets with small differences between 
group sizes (#27, 29, 35 - 37, 40 - 42, 44, 45). In the seven data sets with an index 
of disproportionality greater than 129 (#30 - 32, 34, 38, 43, 47), the Sample 
assumption outperformed the Equal assumption. The Sample assumption also 
outperformed the Equal assumption in three data sets with smaller differences between 
group sizes (#33, 46, 48). 

As with equal group sizes, the rank order of the erroneous assumptions with 
unequal group sizes was expected to be Pop4-.15, Pop4-.30, and Pop4-.45 (i.e., from 
least to most discrepant with actual group sizes). Results were consistent with this 
expectation, parallel to the findings for equal group sizes. The Pop4-.15 condition 
performed better than the Pop4-.30 and Pop4-.45 assumptions and worse than the 
Pop-l-0 and Sample assumptions. The Pop4-.15 assumption was a top contender in 
14 of the 24 data sets, was the best performer in five data sets (# 29, 39 - 42), and 
outperformed (though not always significantly) the Pop 4-. 30 assumption in 20 data 
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sets (#29 - 48). The Pop +.30 assumption outperformed the Pop +.45 assumption, 
though not always significantly, in 21 of the data sets (#26 and 29 - 48). 

Results for Bootstrap Data Sets (Data Sets 49-72) 

Results for the bootstrap data sets were quite similar to results for the 
simulated data sets. As shown in Table 3, tne Pop+0 condition was a top contender 
in all data sets, and had the highest hit rate in 12 data sets (#54 - 56, 58, 61 - 62, and 
67 - 72). Nevertheless, other assumptions yielded higher hit rates (though not always 
significantly higher) in eight data sets (#51, 53, 57, 59, 60, 64 - 66). Thus, as with 
the simulated data sets, these results were somewhat contrary to expectations because 
the Pop+0 assumption did not yield the highest hit rate with every data set. 



Insert Table 3 About Here 



The Sample assumption was a top contender in all but one data set (#70), and 
had the second highest hit rate (behind Pop+0) in nine data sets (#54 - 56, 58, 61, 

62, 67, 71, 72). Compared to the Pop+0 assumption, the Sample assumption yielded 
lower hit rates in 22 of the data sets (#51 - 72), ^ough this difference was statistically 
significant only for data set #70. Compared to the Pop+.15 condition, the Sample 
assumption yielded higher hit rates (though not always significantly higher) in 14 data 
sets (#52, 54, 55, 56, 58, 61 - 63, 67 - 72). Thus, the performance of the Sample 
assumption relative to the erroneous assumptions matches what was found in the 
simulated data sets. 

The Equal assumption worked better than the Sample assumption only in data 
sets with similar group sizes (#51 - 53, 57, 59, 60, 63 - 68 - 70). The Sample 

assumption outperformed the Equal assumption in the seven data sets with indices of 
disproportionality greater than 129 (#54 - 56, 58, 62, 67, 71), as well as in two data 
sets with more similar group sizes (#61, 72). The Pop+.15 condition performed 
better than the Pop+.30 and Pop+.45 assumptions and worse than the Pop+0 and 
Sample assumptions. The Pop+.15 assumption was a top contender in all but three 
data sets (#61, 63, 70), was the best performer in two data sets (57, 65), and 
outperformed (though not always significantly) the Pop+.30 assumption in 20 data 
sets (#52, 54 - 72). The Pop+.30 assumption outperformed the Pop+.45 
assumption, though not always significantly, in 21 of the data sets (#52 - 67). Again, 
these bootstrap results were similar to what was found in the simulated data sets. 
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Discussion 



Pop+0 vs. Other Assumptions 

Although the Pop+0 assumption was the best perfcJsmer in an absolute sense, 
its performance relative to the other five assumptions was not as good as predicted. 
The erroneous assumptions occasionally performed much better than would be 
expected based on their discrepancy with population sizes. For example, in some data 
sets with unequal population sizes, the erroneous assumption of equal priors yielded a 
higher hit rate than the correct assumption about population priors. 

At first glance, these results appear to be inconsistent with Rudolph and 
Karson’s (1988) finding of consistently lower error rate estimates using population 
priors rather than equal priors. This apparent inconsistency may be due to differences 
in relative population sizes between the two studies. In the Rudolph and Karson 
study, the population priors were .9 and .1, reflecting a large discrepancy in 
population sizes. In the present study, equal priors yielded a higher hit rate than 
population priors only in data sets with similar group sizes. In all data sets with non- 
trivial differences in group sizes (I greater than 129), use of population priors 
increased the hit rate over equal priors. 

Further support for this explanation of the apparent inconsistency between the 
two studies comes from a partial replication of the simulation. For data sets with 
similar group sizes (I less than or equal to 129) in which an erroneous assumption 
outperformed the Pop+0 assumption, new simulated data sets were created, each with 
900 I’s and 100 2’s. As in the Rudolph and Karson study, the Pop+0 assumption 
outperformed the erroneous assumptions for every data set. These results are 
displayed in Table 4. Thus, our findings were consistent with Rudolph and Karson 
for data sets with dissimilar group sizes. 



Insert Table 4 About Here 



Still, it may seem counterintuitive that in any data set, erroneous assumptions 
about priors could yield higher hit rates than the correct assumption. An explanation 
for this is related to the differential effectiveness of the two classification rules when 
different priors are used. Suppose the two classification rules are equally effective in 
classifying I’s and 2’s using equal priors. What happens when the rules are adjusted 
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for unequal priors? "For groups of unequal sizes that tend to reflect relative 
population sizes (in an order sense), use of unequal priors will increase the hit rates 
for the larger groups and decrease the hit rates for the smaller groups" (Huberty, 

1994, p. 112). When the increment in hits for the larger group exceeds the 
decrement hits for the smaller group, the overall hit rate is higher. However, 
when the decrement in hits for the smaller group exceeds the increment in hits for the 
larger group, the overall hit rate is lower. 

Consider data set #41 (Table 2), in which the correct (Pop+0) assumption 
about priors yields a lower hit rate than three of the incorrect assumptions (Equal, 
Pop+. 15, Pop+.30). For this data set. Table 5 displays the average separate group 
and total hits for each of the six assumptions about priors. We can see how changes 
in separate group hits affect the results. Compared to the Equal assumption, for 
example, the Pop+0 assumption averages 19 more hits for Group 1 but 24 fewer hits 
for Group 2. Consequently, there are fewer total hits for the correct Pop+0 
assumption than for the incorrect assumption of equal priors. 



Insert Table 5 About Here 



Sample-Estimated Priors vs. Equal Priors 

Relative to the assumption of equal priors, the assumption of sample-estimated 
priors mirrored the Pop+0 pattern. When group sizes were similar, the Equal 
assumption was generjdly superior. When group sizes differed by 13% or more, the 
Sample assumption outperformed the Equal assumption. 

Huberty (1994, p. 65) contends that sample estimated priors are inappropriate 
unless proportional sampling has been used. Results from the current study suggest 
that, perhaps even with proportional sampling, use of population priors may lead to a 
decrement in classification accuracy when group sizes are similar. Additional study is 
needed to confirm this interpretation. 
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Note. The best performing assumption(s) are underlined (p < .00001). 



Cross Validated Classification Performance (Portion of "Hits") for Six Assumptions About Prior Probabilities for Simulated Data 
Sets with 900 I’s and 100 2’s 
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Table 5 

Average Separate Group and Total Hits for Six Assumptions about Priors for Data Set ^41 

Assumptions 



Average Hits 


Sample 


Equal 


Pop+0 


Pop +.15 


Pop +.30 


Pop +.45 


Total 
Group 1 
Group 2 


1233.820 

633.240 

600.580 


1247.644 

616.400 

631.244 


1242.416 

635.124 

607.292 


1260.152 

526.100 

734.052 


1245.284 

415.496 

829.788 


1195.668 

303.612 

892.056 
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